-
Maps characters (text) to byte sequences and vice versa. Used for text fields in serialization.
Naming Confusion
-
When someone says "custom package encoding" , they usually mean:
-
A framing protocol (how message start/end is delimited).
-
A custom serialization/deserialization strategy.
-
A binary or textual format for transmitting structures over the network.
-
-
Using "encoding" for package framing strategies is technically valid but potentially ambiguous.
-
In networking, itβs better to use more specific terms.
-
The word "encoding" itself isnβt wrong but should be interpreted in the technical context.
Text
UTF-8
-
Unicode Transformation Format β 8-bit
-
Size :
-
ASCII characters (0β127) use 1 byte
-
Non-ASCII characters use up to 4 bytes
-
For languages with many non-ASCII characters (e.g., Chinese, Japanese), it can take more space than UTF-16
-
-
Web standard (used by HTML, JSON, XML, etc.)
-
Backward compatible with ASCII; valid ASCII text is valid UTF-8
-
Serialization:
-
UTF-8 can be considered a form of serialization, specifically for binary text serialization
-
UTF-16
-
Size :
-
BMP characters (Basic Multilingual Plane, U+0000 to U+FFFF) use 2 bytes
-
Characters outside BMP (e.g., emojis, historical scripts) use 4 bytes (surrogate pairs)
-
More efficient for languages with many BMP characters (e.g., many Asian languages)
-
-
Widely used in some APIs and programming languages (e.g., Java, Windows, .NET)
UTF-32
-
Size : All characters are 4 bytes, making manipulation and indexing easier
ASCII
-
American Standard Code for Information Interchange
-
Legacy system compatibility : For old systems or devices that only support ASCII
-
Simple English text : When text contains only basic characters (AβZ letters, 0β9 digits, basic punctuation)
-
Simplicity : ASCII uses exactly 1 byte (8 bits) per character, simplifying processing in very basic systems
Base64
-
Is a way to represent arbitrary binary data using only printable ASCII characters.
-
It is not encryption or compressionβjust an encoding so binary data can be stored safely in text formats.
-
It is called Base64 because the encoding uses a numeral system with 64 distinct symbols to represent data.
-
Each Base64 character encodes 6 bits.
-
So you need exactly 64 symbols (
2^6 = 64) to represent every possible 6-bit value.
-
-
Base64 exists as many systems historically handled text only.
-
Raw binary can contain:
-
null bytes (0x00)
-
control characters
-
non-printable bytes
-
-
Converts binary β safe text using only:
AβZ aβz 0β9 + / -
Padding uses
=but it is not part of the base.
Size
-
Base64 increases size by about 33%.
-
4 output bytes per 3 input bytes.
encoded_size β ceil(input_size / 3) * 4
-
"Why on earth would you use an encoding that increases the size of the thing?"
-
Because Base64 solves transport and compatibility problems, not size efficiency. It is used when binary must safely travel through systems that are text-only or text-fragile.
-
Raw binary can break many pipelines due to:
-
null bytes (0x00)
-
control characters
-
encoding assumptions (UTF-8/UTF-16)
-
line-ending conversions
-
legacy text parsers
-
-
Historically (and still today), many formats and tools expect text, not arbitrary bytes.
-
Base64 guarantees the data contains only safe printable ASCII.
-
Core idea
-
Base64 works in 6-bit chunks.
-
Binary bytes are 8 bits each
-
Base64 symbols encode 6 bits each
-
So it repacks data
3 bytes (24 bits) β 4 Base64 characters
-
as
3 Γ 8 = 24 bits
4 Γ 6 = 24 bits
-
Example :
-
Input:
"Man" -
Write bytes:
M = 01001101 a = 01100001 n = 01101110 010011010110000101101110 -
Split into 6-bit groups:
010011 010110 000101 101110 -
In decimals:
19 22 5 46 -
Map to Base64 alphabet:
0β25 β AβZ 26β51 β aβz 52β61 β 0β9 62 β + 63 β / -
Output:
19 β T 22 β W 5 β F 46 β u-
Concatenated:
TWFu -
-
Padding rules
-
If input length is not divisible by 3, Base64 pads with
=. -
Example :
-
Input:
"Ma" -
Write binary:
01001101 01100001 -
Split into 6-bit groups:
010011 010110 000100 000000 -
In decimals -> Map to Base64 alphabet.
-
Output:
TWE=
-